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Problem Statement 


e Tiler GPU's optimize/reduce memory bandwidth 
requirements by rendering per-tile with mrt/color and 
depth/stencil in small internal tile buffer 


e But many anti-patterns exist in GL programs that cause 
unnecessary flush/restore 
— Unnecessary FBO switches 
— Mid-frame texture uploads or UBO updates 
e With some driver cleverness we can reduce this 
— Batch reordering (aka job reshuffling) 
— Resource shadowing (aka ghosting) 


Example super-awesome FPS 
game: triangle-quad 
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But... 


e This is a Super-modern game using a UBO to 
pass color to FS 


— Mid-frame UBO update to change color 


Similar scenario for mid-frame texture uploads 
— but this was an easier example to draw 


Typically a non-tiler GPU driver would use a 
staging buffer to upload new data to modified 
buffer 


Traditional GPU: 


UBO: 


Staging 
Buffer: 
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Buffer: 
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Staging RZA 
Buffer: 
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Staging = 


Buffer: 
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But... 


This doesn't work so well for a tiler gpu 


Tiler GPU: 


Staging 2E 
Buffer: 
BESS [mą 


Clear Draw Quad Staging->UBO Draw Tri 


Nalve/Previous Solution... 


e Flush on mid-frame resource (UBO/texture/etc) update 
e But this is expensive 

— RGBA8 @1080p => 8MB 

- Z24s8 @ 1080p => 8MB 


— MRT and/or higher bpp formats (float16/float32) formats increase this 
proportionally 


e Each unnecessary flush has a corresponding restore 
- To move data back into tile buffer.. 


— So simple RGBA8 + z24s8 => each extra flush costs 16MB write 
bandwidth for flush, and 16MB read bandwidth for restore 


— With MRT (multiple render targets) and/or “exotic” formats this goes up 


So... to the dirty tricks 


We need to shadow resources 
— Buffers: UBOs, textures, etc 
Re-order rendering in case of FBO switches 


— This includes internally genrated u blitter stuff like resource shadowing 
back-blits and mipmap level generation 


These two tricks are related 

— We don't have a separate dma pipe for blits / mipmap generation / etc 
— U blitter — everything looks like FBO switch! 

Fortunately, solving it this way handles FBO switches too 


— vs. special casing blits 


But how to implement? (1) 


e Split out “batch” object 


— vc4 calls this a “job” 
e Basically a “tile pass” 
— Tracks command-stream and all state related to gmem/tile pass 
e Which render target buffers (mrt & z/s) are cleared 
e Stats which we use to decide about tiling/gmem vs bypass 
e Accumulated scissor (lets us skip many tiles for Ul type workloads) 
e Patch-lists 
e Query result bo’s 


— Some tiler gpu s handle this more automatically 


e But adreno requires the driver to handling the tiling in the driver via explicit cmdstream to 
handle restore and resolve 


e So all this state must move from context > batch so that it is still around / valid later 
when we flush and construct gmem/tiling cmdstream 


But how to implement? (2) 


e Batch Cache 


— Construct a hash table key from 
pipe framebuffer state 


— Can'tuse pfb as-is because transient pipe surface 
ptrs 


e On FBO switch (ctx—set_framebuffer_state()) 
— Hashtable lookup to find exsisting unflushed batch 
— Otherwise create new batch and add to hash table 


But how to implement? (3) 


e A bunch of dependency tracking 
— We need to track per resource: 


e N batches that read a resource 
e 1 batch that writes a resource 


— Per draw, look at dependencies of read and written resources 
e Textures, UBOs, VBOs, TF stream-out buffers, query result buffers, etc 
e Resources written by draw — dependency on other batches that read or write 
e Resources ready by draw — dependency on batches that write 
e Need to ensure batches are executed in correct order 


— ie. the batch that writes a resource must run before the one that reads it 
e For example batch that writes TF streamout buffer must run before batch that uses it as VBO 
e Or batch that writes MRT buffer must run before batch that uses it as texture 


— And the batch that overwrites a resource must run after any that read the previous version 
— So, per batch, track the N dependent batches 


e Also needed to ensure the correct batches are flushed before a transfer map(READ) or 
transfer_map(WRITE) 


First try.. 


Track per pipe resource 

— last read batch 

— write batch 

Track single dependent batch per batch 

Low overhead (avoids hash set per bo) 

But introduces too many artifical dependencies 


Solving dependency tracking properly.. 


e Per batch 
— hash set of dependent batches 
— hash set of used (read/write) resources 
e Per resource 
— hash set of batches that read the resource 


— single batch that writes the resource 


e Hash sets are O(1) but big O(1) and lots of extra memory 
allocations 


— You can have 100s of resources (or more) involved in rendering a 
frame 


— And many 100’s to 1000's of draws.. so overhead adds up 


But, 32 batches should be enough 


e We anyways want to limit unflushed batches during 
game/level startup during texture uploads 


e And it is enough for 2x mipmap gen for largest 
possible texture 


— Normal u_blitter batches flushed immediately 
e so never come close to 32 upper limit 


— But needed transiently for back-blits 


e This turns every hashset of batches into a 32b 
bitmask! 


Nice things about bitmasks.. 


e Hash set ops: 
— insert — |= (1 << batch->idx) 
— test > 8 (1 << batch->idx) 
— remove — A= ~(1 << batch->idx) 
— iterate — loop of ffs() (ie, u bit scan()) 
e When vou have manv 100s of draws per batch, 
and up to 16 textures / 32 vbo's/ N ubo's/ TF 


streamout bo's, quero bos, etc, it is nice to keep 
the overhead down 


So basically.. 


e All hash sets go away except batch->resources 


- Tests for inclusion guarded by & (1 << batch->idx) 


e So only do hash set insert for resources that aren't 
already referenced by the batch 


— Probably could go away if we merged 
libdrm freedreno and gallium 
e vc4 does something like this.. 


e But that would mean throwing away kgs! and a2xx 
support 


e Probably worth doing eventually, but not yet 


Results 


e supertuxcart: +30% 

— new render engine has mid-frame UBO updates 
e manhattan: +20% 

- mid-frame texture upload + generate-mipmap 
e glmark2 

- desktop: +7% 

- shadow: +20% 


